LMP: Light-Weighted Memory Protection with Hardware Assistance 


Wei Huang 

wh.huang@mail.utoronto.ca 


Zhen Huang 

z.huang@mail.utoronto.ca 


Dhaval Miyani 

dhaval.miyani@mail.utoronto.ca 


David Lie 

lie@eecg.toronto.edu 


Department of Electrical and Computer Engineering 
University of Toronto 


Abstract 

Despite a long history and many proposals, memory cor¬ 
ruption attacks are still viable - a secure and low-overhead 
defense against return-oriented programming (ROP) contin¬ 
ues to elude the security community. Currently proposed 
solutions still must choose between either not fully protect¬ 
ing critical data and relying instead on information hiding, 
or using incomplete, coarse-grain checking that can be cir¬ 
cumvented by a suitably skilled attacker. In this paper, we 
present a light-weighted memory protection approach (LM- 
P) that uses Intel’s MPX hardware extensions to provide 
complete, fast ROP protection without having to rely in in¬ 
formation hiding. We demonstrate a prototype that defeats 
ROP attacks while incurring an average runtime overhead 
of 3.9%. 

CCS Concepts 

• Security and privacy —> Malware and its mitigation; 
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1. INTRODUCTION 

In languages such as C/C-|—|-, the programmer is ulti¬ 
mately responsible for enforcing the memory safety of their 
programs. However, inevitably, programmers produce code 
with flaws that violate memory safety, and some of these 
flaws result in memory corruption vulnerabilities that allow 
attackers to maliciously alter the control flow of program- 
s [29], corrupt critical data [18], or cause sensitive informa¬ 
tion leakage [12]. 

There have been numerous proposed or deployed defens¬ 
es to mitigate memory corruption vulnerabilities. Despite 
this, memory corruption vulnerabilities continue to be ex¬ 
ploitable. For example, ASLR (Address Space Layout Ran¬ 
domization) [27] randomizes memory locations of code and 
data segments, but can be circumvented via vulnerabilities 
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such as address space leakage, timing side-channels [19] or 
attacks such as just-in-time code reuse [32]. DEP (Data 
Execution Prevention) [24] prevents injecting and execut¬ 
ing new code in vulnerable programs. However, it cannot 
prevent reusing existing code in an application via a return- 
to-libc or ROP (Return-Oriented Programming) attack [29] . 

To address ROP attacks, Abadi et al. propose Control- 
Flow Integrity (CFI) [2]. CFI protection enforces both forward- 
edge protection (i.e. indirect function calls) and backwards- 
edge protection (i.e. function returns) to ensure that a 
memory corruption vulnerability does not allow an attack¬ 
er to corrupt a code pointer and redirect execution along 
an edge not specified by the original program. While the 
target of a forward-edge function call can be resolved to a 
single or small number of targets statically, the target of a 
backwards-edge function return cannot generally be deter¬ 
mined with much precision using only static analysis. As a 
result, backwards-edge protection generally requires a run¬ 
time component. To determine and enforce backward-edges 
precisely, shadow stacks are proposed in [2] and software- 
based fault isolation (SFI) [37] is further used to protect 
the contents of the shadow stacks from corruption by an 
attacker. Unfortunately, the runtime overhead of the mem¬ 
ory checking required to properly implement this runtime 
component can be as high as 2x [8]. 

To reduce this overhead, various proposals weaken the 
properties of the backwards-edge protection in return for 
better runtime performance. For example, some propose 
coarse-grain protections, which do not use a shadow stack 
to precisely track backwards-edge targets. Since shadow s- 
tacks are not used, there is no need for SFI, which avoids 
the expensive checks required to implement memory protec¬ 
tion for the shadow stacks. This coarse-grain approach is 
taken by proposals such as kBouncer [26], ROPGuard [15], 
ROPecker [6] , which have significantly lower overheads rang¬ 
ing from 1.59% to 2.60%. These coarse-grain methods are 
imprecise in that they do not actually validate that the re¬ 
turn address on a backwards-edge actually points to the o- 
riginal caller; instead, they either only check that the return 
address points to an instruction that follows some call in¬ 
struction, or they heuristically check the number of returns 
to detect gadgets executions. They have been shown to be 
circumventable [11,17] and attackers can bypass them all. 

Information hiding is another way to mitigate the over¬ 
head of complete CFI backwards-edge protection. In this 
approach, rather than protecting the data in the shadow 
stacks with memory access checks, the shadow stacks are 
placed at a random location in a 64-bit address space. Be- 


cause the size of the address space is large, it is assumed in¬ 
feasible for the attacker to guess the location of the shadow 
stacks. One method called code-pointer integrity (CPI) [22] 
is able to provide CFI protection with 2.9% overhead (on C 
applications). However, information hiding techniques can 
be broken by memory safety vulnerabilities that leak the 
location of the shadow stacks [14]. Other work has also 
shown that various side-channel attacks can be used to leak 
information that can be used to find the hidden shadow s- 
tacks [30,33]. The lesson here is that ultimately information 
hiding is not equivalent to memory protection, as they are 
vulnerable to address information leakage, while memory 
protection is not. 

In this paper, we propose Light-Weighted Memory Pro¬ 
tection (LMP), a new method that leverages Intel’s Mem¬ 
ory Protection Extensions (MPX) to make backwards-edge 
CFI both secure and efficient. LMP tackles two essential 
problems that stand in the way of memory safety in system 
software: critical memory region protection in backwards- 
edge CFI approaches and non-trivial overheads in checking 
memory access violations. 

While hardware-supported memory checks are naturally 
more efficient than software memory checking, which is also 
proven in recent work on using customized hardware for CFI 
enforcement [7, 10], we find that the hardware extensions 
like Intel MPX have to be applied carefully. In particular, 
not all of the operations supported by Intel MPX have low 
overhead. Therefore, we design LMP to minimize the use 
of the high-overhead components of MPX and still enable 
it to effectively protect shadow stacks from unauthorized 
modification. 

We build a proof-of-concept prototype implementation of 
LMP and measure the performance overhead with SPEC 
2006 benchmarks. The LMP system introduces an average 
overhead of 3.90%, which is much less than the 2x over¬ 
head from the reference implementation of the original C- 
FI [8]. In fact, LMP achieves roughly same overhead as 
information hiding techniques [9,22], which have generally 
about 3% overhead. LMP is also comparable with recent 
coarse-grained CFI approaches, which have overheads be¬ 
tween 1.59% (ROPGuard [15]) and 2.60% (kBouncer [26]). 
However, LMP provides stronger security guarantees than 
both information hiding and coarse-grain approaches, as it 
is both not vulnerable to either side-channel leakage and 
enforces a much stricter policy. 

We summarize three main contributions this paper makes: 

1. We propose an alternative use of hardware assisted 
pointer checker with Intel MPX that is different from 
the standard proposed use of MPX. 

2. We provide the first stack protection solution that is 
assisted by the available CPU feature of Intel MPX. 

3. We achieve a low overhead among existing equivalent 
solutions, while provide stronger protection than coarse- 
grain backward-edge CFI approaches. 

The rest of this paper is organized as follows: We present 
background information about hardware assistance of Intel 
MPX we depend on and threat model we assume in Sec. 2, 
describe the method we use in Sec. 3 and details of imple¬ 
mentation in Sec. 4, evaluate our results in Sec. 5, discuss 
related work in Sec. 6 and conclude in Sec. 7. 


2. BACKGROUND 

Before describing our approach to protection, we first de¬ 
scribe the base MPX hardware that LMP leverages. Intel’s 
Memory Protection Extensions (MPX) are a set of exten¬ 
sions to the x86-64 instruction set architecture in the Intel 
Skylake processors. To check pointer references at runtime 
and prevent illegal memory accesses, the idea was imple¬ 
mented previously as the feature of Pointer Checker [16] in 
the Intel compiler for debugging: a pair of bounds is cre¬ 
ated whenever a pointer is made, then the compiler will 
also generate code to check the bounds when the pointer is 
used. Pointer Checker is fully software-based, while MPX 
provides hardware acceleration for the bound checks that 
Pointer Checker would have done in software. MPX has 
software and hardware components. 

For the hardware part, MPX introduces several new reg¬ 
isters and instructions to the instruction set architecture: 

• 4 bound registers: BND0-BND3. Each of the registers 
is 128-bit, and they store the lower bound memory 
address with 64 bits and the upper bound memory 
address with 64 bits. Bound registers hold the upper 
and lower bounds that memory accesses are checked 
against. 

• 2 configuration registers: BNDCFGU for user mode and 
IA32_BNDCFGS for supervisor mode. 

• 1 status register: BNDSTATUS which stores error code 
when exception occurs. 

• Bound management instructions: BNDLDX and BNDSTX 
load BND registers from a table of object-specific ad¬ 
dress bounds in memory. BNDMK and BNDM0V allow a 
programmer to manually manage the BND registers. 

• Bound check instructions: BNDCU and BNDCL are used 
to check that a pointer meets the respective upper and 
lower bound limits of a specific BND register. If the 
pointer falls outside of the bounds, then the instruction 
throws an exception, saving the need for an instruction 
to explicitly check the result of the comparison. 

For the software part, the MPX requires the following 
system software support: 

• MPX-enabled Compiler: The compiler is responsible 
for inserting bound checks before pointer dereferences. 
Because bound information must be loaded in a limited 
number of BND registers before it can be used to check 
a pointer, the compiler must also load and spill bound- 
s information between the BND registers and memory. 
For now, Intel has added MPX support to GCC main 
branch since version 5.0 for C/C++ and x86 targets 
only. 

• MPX Runtime: The MPX runtime library is linked a- 
gainst program at compile-time. The library provides 
an API that the application developer can use to con¬ 
figure MPX hardware features, as well as functions to 
help compiler generated code manage MPX registers. 

• Operating system: The OS, together with the compil¬ 
er, needs to support the new instructions dedicated to 
MPX. If a bound check instruction fails, the OS must 
catch the generated exception and signal the applica¬ 
tion. 



Figure 1: An example of how MPX works. 


We now give an example of how these MPX components 
can be used to bound-check a small program. Consider a 
program that declares and manipulates data in 5 arrays: 

int A[10], B[20], C[30], £>[40], £[50]; 

Anytime a pointer pointing into one of these arrays is 
dereferenced, the MPX compiler needs to insert bound-checks 
to ensure that the pointer falls within one of these arrays. 
To do this, the MPX compiler needs to determine which ar¬ 
ray the pointer should be pointing into, load the upper and 
lower bounds of the array into a BND register and then insert 
the appropriate BNDCU and BNDCL checks before the pointer 
dereference to check it against the upper and lower bounds 
of the array. For example as showed in Figure 1, if array A 
is stored at addresses 0x7ffffba0ac70-0x7ffffba0ac94, the 
MPX compiler must first load the upper and lower bound 
addresses 0x7ffffba0ac70 and 0x7ffffba0ac94 into one of 
the bound registers (say BNDO). This is done using the B- 
NDLDX instruction, which loads the bound information from 
the bound directory in memory to into the appropriate reg¬ 
ister. Then the MPX compiler instruments bound checking 
instructions to compare the pointer dereference with bound 
values in BNDO. If the dereference falls out of the bound, a 
#BR exception will be generated by hardware and catched 
by the exception handler in MPX runtime. 

For a pointer into an array to be bound-checked, the 
bounds for that array must be loaded into a BND register. 
Since the arrays A, B, C, D and E are all located in different 
regions in memory, the MPX compiler must load the appro¬ 
priate array bounds into a BND register whenever a pointer is 
used to dereference a location in a different array. Because 
there are 5 arrays but only 4 BND registers, it is impossible 
for the compiler to keep the bounds for all the arrays in a 
BND register all the time. This results in many BNDLDX and 
BNDSTX instructions being generated by the compiler to load 
and spill the bounds information to and from memory. 


The bound checking instructions (BNDCU and BNDCL) have 
very low execution cost. However, the BNDSTX and BNDLDX 
instructions have to access to the 2-layer structured bound 
tables stored in the main memory, they are very slow com¬ 
pared to bound checking instructions. To measure this cost, 
we did an experiment comparing BNDCU with BNDSTX/BNDLDX 
instructions. We randomly generate 1000 memory address¬ 
es, and use an address lower than them all to perform 1000 
times BNDCU instructions, and made sure there are not bound 
violations. Then we use BNDSTX to store the first 500 in¬ 
structions into bound tables, and load them all back one 
by one to a bound register BNDO. The results of this ex¬ 
periment show that the bound checking instruction, BND¬ 
CU, has almost same execution time as a NOP instruction 
(1000 instructions in 0.45ms), while the bound store+load 
instructions BNDSTX/BNDLDX cost almost 1000 x more than 
NOP (1000 instructions in 432ms). 

With real applications, the number of objects in the bound 
table can become quite large. However, as the number of B- 
ND registers is fixed at 4 in the hardware architecture, this 
causes heavy use of the BNDSTX and BNDLDX instructions, re¬ 
sulting in high overhead. With recent MPX-enabled GCC 
(version 6.1) as a reference implementation of MPX compil¬ 
er, the runtime performance overheads with running SPEC 
2006 benchmarks can be as large as 2x to 4x . To ensure 
low overheads, this indicates that the number of BNDSTX and 
BNDLDX instructions must be minimized. Ensuring this is one 
of the main reasons LMP is able to provide low overhead, 
whose design we describe in the next section. 

3. METHODOLOGY 
3.1 Threat Model 

We assume a realistic attacker can exploit a memory cor¬ 
ruption vulnerability to change arbitrary memory locations 
(so long as they are permitted by the hardware) to values of 

















































their choosing. We also assume that the attacker is aware of 
the address locations of key data structures such as pointers, 
stacks and meta-data and can arbitrarily target them with 
the memory corruption vulnerability. We assume the goal 
of the attacker is to corrupt a code pointer to compromise 
the control-flow integrity of a program. 

Despite this powerful attacker, we do assume that the at¬ 
tacker is limited in some realistic ways. For example, the 
attacker cannot directly modify registers in CPUs or change 
any memory that is marked read-only, such as the code 
pages, as both would allow the attacker to remove or by¬ 
pass the compiler-inserted instrumentation that LMP uses. 
The attackers also cannot compromise the integrity of the 
target program before it is loaded into the memory, which 
means that attacks on the program loader and operating 
system are out of scope for LMP. LMP is intended to mit¬ 
igate the exploitation memory corruption vulnerabilities by 
remote or unprivileged attackers for the purposes of privilege 
escalation. 

In general, there are two types of code pointers that need 
to be protected: function-pointers (i.e. forward-edge) and 
return addresses (i.e. backwards-edge). LMP focuses on 
protecting against attacks on return addresses and assumes 
use of an existing forward-edge CFI protection scheme to 
protect functions pointers from being corrupted. There is a 
rich body of literature addressing the problem of forward- 
edge protection. For example, the virtual calls in C-|—h 
indirect-control transfers through VTables can be hijacked 
by attackers [5] to redirect execution to malicious code. These 
type of protections can be attained with low overhead by 
previous work, such as VTV [34], VTable Interleaving [3] 
and VTrust [40]. Our LMP system can work together with 
current forward-edge CFI defenses to provide full CFI pro¬ 
tection. 

3.2 Memory Protection with MPX 

LMP uses two components to protect return addresses: 
the shadow stacks and the protected memory region alloca¬ 
tor. First, standard shadow stacks are used to maintain a 
second copy of return addresses. The shadow stack is updat¬ 
ed on a function call and checked when functions return. An 
attacker would have to corrupt both the program stack at 
function call site and the shadow stack to successfully cor¬ 
rupt a return address. Thus, to prevent the attacker from 
corrupting the shadow stack, MPX instructions are insert¬ 
ed by LMP to ensure that only the instructions inserted by 
LMP at function calls to update the shadow stack can write 
to the shadow stack. 

Based on the threat model described in Sec. 2, only s- 
tore operations could modify the shadow stack area, and 
the code pages are read-only so an attacker could not re¬ 
move bound checks to store operations. An attacker could 
try to jump directly to a store instruction and avoid exe¬ 
cuting the bound-checks, but to do this, the attacker would 
have to corrupt a code pointer, which the CFI provided by 
LMP a complementary forward-edge CFI scheme prevent. 
Thus, the backwards-edge protection LMP provides hinges 
on the ability to protect the shadow stacks from corruption 
by a memory safety vulnerability. 

To protect the shadow stack, we instrument each store in¬ 
struction in the program to make sure that it cannot access 
the memory region of shadow stacks even if the attacker has 
modified the effective address that the instruction targets. 


Despite, there being many store instructions in the program, 
they are all checked against the same bounds, as LMP need 
only check that they do not target the shadow stack. This 
is efficient since this avoids the need to use the expensive 
BNDLDX and BNDSTX to modify the bounds that LMP must 
check - LMP simply sets the upper and lower bounds of a 
BND register to the lower and upper regions of the shadow 
stack and proceeds to instrument each store in the program 
to ensure that it does not fall within that region. However, 
in multi-threaded programs, there will be one shadow stack 
for each thread. A naive solution would use a different B- 
ND register to store the upper and lower addresses for each 
stack, but this would require the expensive BNDLDX and B- 
NDSTX instructions to load and store the stack bounds into 
the BND registers, hurting performance. Instead, we observe 
that all shadow stacks are in the same protection class - i.e. 
regardless of which thread a store is executing in, it should 
not be able to access any of the shadow stacks. This means 
that all shadow stacks can be placed in a contiguous region 
of memory and protected with a single BND register. Thus, 
the other component of LMP is a scheme that allocates stan¬ 
dard shadow stacks so that they are in a single contiguous 
region of memory. In the same way, all other auxiliary da¬ 
ta structures that LMP employs are also be protected from 
modification, by being allocated in the protected region that 
is restricted by MPX instructions. 

3.3 Using the Shadow Stack 

In order to restrict return instructions, LMP records the 
return address in the shadow stack upon each function call, 
where it will be protected from corruption by an attacker. 
We illustrate the idea of shadow stack layout of the LMP 
system in Figure 2. 

Another difference from the other shadow stack approach¬ 
es is that LMP compares function return address with the 
one stored in the shadow stack using MPX bound checking 
instructions. It optimizes the overhead from compare/branch 
instructions in standard shadow stack implementation and 
details will be presented later in this section. 

As mentioned earlier, the shadow stacks are all located 
in a contiguous region of memory. Moreover, this region is 
statically defined at program startup and since it is inac¬ 
cessible to any memory instruction other than shadow stack 
operations inserted by LMP, the region cannot be used to 
store any other type of data other than shadow stacks. The 
main difference between our shadow stack implementation 
and other shadow stack or safe stack implementations [9] 
is that LMP is not free to place shadow stacks any loca¬ 
tion or offset-based region for convenience, but must instead 
place them in the predefined shadow stack region. Since each 
thread must have its own shadow stack, we must define a 
mapping function that allows the shadow stack code to find 
the shadow stack for any given thread, but also maps each 
shadow stack into the predefined region. 

One option is to make the predefined region as large as 
the region where regular stacks can be allocated. This would 
be efficient as each shadow stack could then be located at 
a fixed offset from the thread’s regular stack. However, the 
pthread interface permits stacks to be created anywhere in 
a process’ virtual address space. As a result, we would have 
to reserve one half of the virtual address space for the pre¬ 
defined region. While this is likely acceptable in most cases 
for 64-bit code, it can present problems if processes need to 
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Figure 2: The illustration of LMP shadow stacks 


allocate memory at a particular virtual address space. 

Instead, a more costly, but flexible alternative is to dy¬ 
namically allocate and map stack space from the predefined 
region as threads and their corresponding stacks are creat¬ 
ed. While this might be slightly more expensive than the 
fixed-offset approach, we show that it is still practical, and 
can serve as a conservative estimate for the performance 
overhead of LMP. LMP uses a mapping table that stores 
the offset between a thread’s regular stack and correspond¬ 
ing stack. The predefined region is then partitioned into 
several fixed-sized shadow stacks, and another table records 
which shadow stacks are in use and which are free. When 
a thread is created, LMP finds an unallocated shadow stack 
and updates the mapping table with the offset between the 
thread’s regular stack and its newly allocated shadow stack. 
When a thread is destroyed, the thread is deallocated and 
the offset in the table is cleared. These allocation and deal¬ 
location operations only occur during thread creation and 
destruction. 

LMP inserts instrumentation on function entry that s- 
tores the return address into the shadow stack. Because 
this memory operation is inserted by LMP, it needs not be 
bound-checked. At function return, LMP inserts instrumen¬ 
tation that will find the corresponding return address in the 
shadow stack and compare it against the address that con¬ 
trol flow is going to. In this way, the shadow stack can 
ensure that when execution returns, the integrity of the re¬ 
turn address is not tampered with. A thread’s regular and 
shadow stack have the same layout so a return address on 
the regular stack will have the same offset from the base of 
the stack as the corresponding return address’ offset from 
the base of the shadow stack. Thus, only the offset between 
the regular stack base and the shadow stack base needs to 
be stored in the mapping table. This design is different from 
the related work [1] which is also a compiler-based approach 
as when call rewind happens, there is no need to pop the 


PUSH %rsp 

CALL _map_table 

# find shadow stack via mapping table 

# return shadow stack address in %rax 

MOV (%rsp), %rdx 

MOV %rdx, (%rax) 

# copy ret addr to shadow stack address in %rax 

(FUNCTION CALL BODY) 


MOV (%rsp), %rdx 

# put function return address in %rdx 

BNDMK %bndO, [(%rax), 0] 

# put the address in shadow stack in a bnd 

# register %bnd0 

BNDCU %rdx, %bnd0 
BNDCL %rdx, %bnd0 

# check return address with the one in shadow 

# stack 


Figure 3: Assembly code example for instrumented function 
entry/exit. 


shadow stack to find a match. Both the function entry and 
function return instrumentation use the mapping table to 
find the corresponding shadow stack for the thread. 

We give an example of execution sequence in steps after 
code instrumentation for shadow stack operations, and an 
assembly code snippet in Figure. 3: 

1. On function entry: 

(1) prepare shadow stack address in register %rax 

(2) copy return address in %rsp to shadow stack 

2. Execute function call and body 

3. On function return: 

(1) copy return address in shadow stack to bound reg¬ 
ister %bndO 
















































































Figure 4: A Bow chart of how LMP system works. 


(2) use bound checking instruction to check return 
address in %rsp and %bndO 

We use MPX bound checking instructions BNDCL and B- 
NDCU instead of a series of compare and jump instructions 
to do the equality comparison. We set the return address 
in the shadow stack as the upper and lower bound in the 
bound register (BNDO), then bound-check it against the func¬ 
tion return address. Using MPX instructions to check the 
return address improves performance the same way the M- 
PX instructions improve memory bound-checks - the MPX 
instructions avoid extra branch and check instructions that 
would normally be needed to check the result of the com¬ 
parison. Instead, MPX instructions will throw an exception 
if the check fails. 

3.4 Execute a Program with LMP 

We give an illustration of our LMP system conceptual de¬ 
sign by providing a simple example of how the LMP system 
works with a user program, as shown in Figure. 4. 

The LMP-enabled compiler instruments the application 
source code at compile-time. When the program starts, the 
LMP runtime prepares the shadow stack memory region and 
stores its lower boundary and upper boundary to the bound 
register BND1. This is for the protection of the shadow stack 
from any illegal modification. When the program is run¬ 
ning, it stores return addresses to the shadow stack when 
a function call happens and the return address is pushed 
to the normal call stack. When the function returns, two 
addresses stored in the normal stack and in the shadow s- 
tac.k is compared. Throughout the program, whenever there 
is a memory operation that stores values to a memory ad¬ 
dress, we instrument the code to verify that the address is 


not in the range of the shadow stack using bound checking 
instructions. 

Under certain special cases, such as C++ exception han¬ 
dling, the call stack will unwind due to set jmp/longjmp in¬ 
structions causing function call and return mismatching. In 
the method we propose with LMP, as long as the compiler 
does not change the original call stack with exception infor¬ 
mation (e.g., GCC stores it in another side-table), the return 
addresses in original call stacks and in shadow stacks corre¬ 
spond to the same offset to the stack top addresses, thus the 
stack unwinding by exception handling operations will not 
be an issue. 

The LMP approach can potentially be extended to pro¬ 
vide backward-edge protection for binary-only CFI. With 
a control-flow graph (CFG) generated through disassembly 
analysis of a binary, and some changes to pthread library 
functions, the LMP system can also work with binary-only 
CFI approaches as well by applying binary re-writing tech¬ 
niques. 

4. IMPLEMENTATION 

The LMP system has two main parts: The LMP-enabled 
compiler and the LMP runtime library. For the compiler 
part we modify the RTL passes for instrumenting boundary 
checking to ensure that there can be no unauthorized writes 
to the memory region where the shadow stacks is stored. 
The LMP runtime is responsible for managing the alloca¬ 
tion of shadow stack and store of the return addresses from 
function call stacks. 

4.1 LMP-enabled Compiler 

The implementation of LMP-enabled compiler is based 
on GCC 5.2.0 with approximately 600 lines of code modi¬ 
fied/added to the RTL passes. The main reason for modify¬ 
ing the compiler and adding new RTL passes is to do code 
instrumentation at the assembly level. Both shadow stack 
operations and code to protect the shadow stack memory 
region from being modified are instrumented by the LMP 
compiler. 

In the GCC RTL passes, we modify the source code in fi¬ 
nal . c and insn-output. c that take care of assembler code 
output for functions. Among them, f inal_end_function() 
helps emit assembly code in function exit, we add our code 
here to do instrumentation for shadow stack operations. 

To implement shadow stacks, at each function call stack 
operation when the function pushes return address, the com¬ 
piler instruments the code to get the address and a call to 
gettidO, then the thread needs to lookup the offset via the 
LMP runtime and stores the return address to the shadow s- 
tack. At first, it might seem like a call to gettidO would be 
overly expensive, but such operations are highly optimized 
and our measurement shows that the cost of this is negli¬ 
gible. At each return instruction, the compiler instruments 
the code to get the ThreadID and ask the LMP runtime 
for the return address stored in the shadow stack. If the ad¬ 
dress in the return instruction does not match the one in the 
shadow stack, it sends a bound violation message to LMP 
runtime. In the GCC passes, we identify the function calls 
by looking for the RTL expression code call_insn, with 
the format: 

(call (mem : fm addr) nbytes) 
where the addr is the address of that subroutine. 

























Before 

J\ 

After 

4007b5: ADD $0xc,(%rax) 

> 

4005e1: ADD $0xc,(%rax) 
4005e5: BNDCU %rax,%bnd1 


V 

4005ea: BNDCL %rax,%bnd1 


Figure 5: An example of LMP instrumentation for store in¬ 
struction. 


For bound checking of memory operations, we change the 
RTL passes of GCC to find RTL expressions containing 
memory operations that store values to main memory ad¬ 
dress. The address is taken to compare with the upper and 
lower boundary addresses of the shadow stack, which is s- 
tored in the bound register BND1, where the bounds of the 
memory region where the shadow stacks reside is stored. A 
bound violation will be triggered if the address falls into the 
memory range of the shadow stack which means the point¬ 
er that the memory store uses as its target has likely been 
corrupted by an attacker. 

We give an example of the code instrumentation results in 
Figure. 5 to show the assembly code before and after instru¬ 
mentation. The add instruction writes to main memory, and 
the instrumented assembly code bndcu and bndcl checks if 
the memory address to be changed is within the protected 
shadow stack region. 

4.2 LMP Runtime 

The LMP runtime is implemented with approximately 700 
lines of C source code. As this is a proof-of-concept proto¬ 
type design, we allocate a virtual memory region of 2GB for 
the shadow stacks. The reason behind the number of memo¬ 
ry size is that in our test environment the OS has maximum 
number of 62057 threads (from 

$cat /proc/sys/kernel/threads-max), and for each possi¬ 
ble thread we give 32KB to the shadow stack, which we be¬ 
lieve is more than enough as the benchmarks we used never 
exceed 8KB per thread in call stack. In our implementa¬ 
tion, both the numbers of maximum threads and the space 
for each shadow stack are tunable. Since the shadow stacks 
are allocated in the 64-bit virtual address space, they only 
take a tiny fraction of it. Also, because most of the shadow 
stacks may never be written to, they only consume virtu¬ 
al address space and the operating system never needs to 
actually allocate physical memory to back them. 

We could have also dynamically allocated shadow stacks 
in memory, which would allow the shadow stack region to be 
dynamically extended and reduced in size to accommodate 
growth and reduction in shadow stack usage. This would 
likely add some overhead in exchange for better virtual ad¬ 
dress space utilization. However, given that virtual address 
space is generally not a limiting factor on 64-bit architec¬ 
tures, we do not believe that this extra overhead is justified. 

When the instrumented program needs the LMP runtime 
to store a function return address to the shadow stack, the 
runtime takes the offset between the base of the call stack 
and the address that stores the return address, and a Threa- 
dlD to process them in function LMP_push_ss (return_addr, 
offset, threadID), then finds the shadow stack prepared 
for that thread and stores the function return address in 
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Figure 6: LMP overhead by comparison of execution time be¬ 
tween baseline and LMP. 


the shadow stack. When the program function returns and 
the address needs to be compared with the one stored in 
the shadow stack, it calculates the offset between the base 
of the call stack and the address that stores the function 
return address and uses return_addr=LMP_pop_ss (off set, 
threadID), then LMP runtime will get the return address 
stored in the shadow stack. 


5. EVALUATION 

In this section we evaluate the effectiveness and different 
aspects of overheads of our LMP system. We run our exper¬ 
iments on an Intel i5-6600K with 4 cores @3.5GHz in 64-bit 
mode with 8G RAM. The benchmarks are run on Fedora 22 
with Linux kernel 4.1.7. 

5.1 Performance Overhead 

We evaluate the overheads of the LMP system using CIN- 
T 2006 benchmarks. All results are 5-time average numbers 
that gathered from the non-reportable mode of SPEC bench¬ 
mark. We compare the results with the baseline without ap¬ 
plying LMP. As shown in Figure. 6, the average performance 
overhead of LMP in comparison to the baseline performance 
is 3.90%. The h264ref benchmark has the highest overhead 
of 12.55%, mainly because it has many more function call- 
s and RET instructions than others. Without the h264ref 
benchmark the average overhead is only 2.12%. 

To justify the main sources of overheads introduced by 
the LMP system, we further separate them into three parts 
of the system: context settings, bound-checking and shad¬ 
ow stack operations. Context settings includes the runtime 
library initialization, retrieving ThreadID via system calls 
etc. Bound-checking involves the time that spent by MPX 
bound instructions. Shadow stack operations consist of all 
operations dealing with the shadows stacks. 

We measure how much each component contributes to the 
overall overhead by removing the other 2 components and 
measuring the overhead with only one component added to 
each benchmark. Over all the CINT 2006 benchmark result- 
s, the average overhead of context settings is 0.1%, bound 
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Figure 7: Overhead components of LMP. 
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Figure 8: Code Expansion of LMP. 


checking is 0.52% and shadow stack operations is 3.27%. 
From Figure 7 we can find that context setting and bound¬ 
checking almost contribute negligible amount of overheads. 
Shadow stack operations are the main contributor, which on 
average accounts for 84% of all the overheads. The perfor¬ 
mance penalty of the memory protection is only 15% of the 
overall overhead and the remaining 1% can be attributed 
to infrequent setup and stack allocation/deallocation oper¬ 
ations. The results here are inline with other heavily op¬ 
timized shadow stack implementations [9] that claim a few 
variants of shadow stacks performance overheads around be¬ 
tween 2% and 10% for the same benchmark set. As a result, 
we believe this overhead is representative of the costs of LM- 
P on current processors. 

5.2 Code Expansion 

LMP-enabled GCC emits assembly code to instrument the 
target program in the RTL passes, so there is an increase in 
code size. We directly compare the sizes of the binaries of 
each benchmark and calculate the percentage of code expan¬ 
sions that LMP introduces. 

From Figure. 8, we can see that across the 9 benchmarks 
we have run, the code at assembly level expands by 39.27% 
in average. There is some variance among the code ex¬ 


pansion numbers of the benchmarks, while the majority of 
which is contributed by the bound checking instructions, 
when there are more function calls/returns and memory s- 
tore instructions of the benchmark, the more bound checking 
instructions are instrumented. Noting that since it is a pro¬ 
totype implementation, we have debugging code added to 
the instrumentation which should not be executed for nor¬ 
mal cases, and due to the time limit of the development we 
did not remove all of them in the evaluation. 

5.3 Memory Overhead 

The memory overhead introduced to the benchmarks on 
average is 19.3MB per program, and the average percentage 
of the maximum resident memory overhead is 9.73%. The 
memory overhead is mainly from the runtime library part 
of LMP system which manages the shadow stacks. As men¬ 
tioned in Sec. 4 the memory allocation is not optimized in 
this research prototype implementation, which means there 
is certainly much space for improvement. We expect the 
memory overhead could be decreased significantly by adding 
dynamically allocating the mapping table as needed instead 
of pre-emptively allocating it for the maximum number of 
threads. 

6. RELATED WORK 

We review literature in the area of defense technologies to 
protect programs from control flow hijacking attacks. 

Traditional attack methods using stack-smashing and code 
injection [28] can be protected by applying recent adoption 
of data execution prevention (DEP) [24]. Hardware support 
for DEP is currently common used as the non-execution bit 
(NX bit, or called XD/XN bit depending on processor ar¬ 
chitecture), such that code in the data segment cannot be 
executed. 

To counter the protection above, attackers have develope- 
d more sophisticated methods that do not rely on injecting 
new code, and that instead, rely on using existing code in the 
program. One of the early examples is return-into-libc at¬ 
tack [35], which can redirect program execution flow through 
libc functions. Similar exploitations such as return-oriented 
programming (ROP) attack [29] can also execute arbitrary 
computations by using a chain of existing code after chang¬ 
ing return address at the function call stack. They are both 
considered to be Turing-complete. 

Randomization is practical in hiding information about 
the memory layout of a program from attackers. Address 
Space Layout Randomization (ASLR) [27] is proposed to 
defend against ROP attack, by mapping program processes 
and dynamic libraries into random virtual address space ev¬ 
ery time. Address Space Layout Permutation (ASLP) [21] 
further re-orders sub-routines at the code segments on the 
basis of the randomization provided by ASLR. However, the 
implementations of ASLR were soon to be found ineffective 
against de-randomization attack [31] with a few hundred sec¬ 
ond additional time to compromise the target program, and 
ASLP is vulnerable too [23] . 

CFI (Control Flow Integrity) [2] is introduced to guaran¬ 
tee that indirect control-flow transfers point to legitimate 
locations. For ensuring the return addresses in function call 
stacks are not tampered with, shadow stacks to store copies 
of return addresses are suggested. However, the performance 
overhead of original CFI is reported as high as 2x if the ex¬ 
act policy is enforced, so there are variants of coarse-grained 




CFI proposed with changes to the original policy. kBounc- 
er [26] uses the Last Branch Record (LBR) x86 register that 
stores recent branches that CPU executed. It validates if 
the return address points to an instruction follows a call 
instruction, so the procedure is actually a heuristic miti¬ 
gation of ROP attack. Using the same LBR register and 
similar policy as kBouncer, the work of ROPecker [6] adds 
additional static analysis to speculate future execution of a 
program to defend against ROP gadgets running, unfortu¬ 
nately however, is by-passible too [11]. The ROPGuard [15] 
proposes to check if the stack pointer points to a memory 
address outside of the stack, so the system would not allow 
ROP attackers execute payloads on the heap, however, be¬ 
fore the target function is called the adversaries could still 
modify the stack pointer. The above defenses are also vul¬ 
nerable to attacks that leverage hooks and hide malicious 
code within non-control data [36], if critical memory region 
is not protected at runtime. O-CFI [25] explores random¬ 
ization approach to conceal program control-flow graph and 
applies MPX in bound-checking for guarding the branch in¬ 
structions, however, it is still a coarse-grained CFI method 
and only provides probabilistic security guarantees since it 
does not fully protect function return addresses. Our LMP 
approach sticks to the original CFI policy in backward-edge 
protection, i.e., checking every function return address and 
ensuring the return address points to the function caller. 

For forward-edge CFI protection, the paper that proposes 
VTV [34] finds out more than 90% indirect calls are virtual 
calls. Their method aims at protecting VTables from be¬ 
ing hijacked, validates at runtime that the target VTables 
in a legit set, before a virtual method call is made. Per¬ 
formance of VTV depends on the size of legit VTable set, 
so the complexity of C+-1- class hierarchy would affect the 
overhead. On the basis of the idea, VTrust [40] and VTable 
Interleaving [3] improve the performance of VTV without 
needing global class hierarchy, and prevent VTable hijack¬ 
ing attacks. Our LMP system does not provide protection 
with forward-edge CFI, because with above mentioned ap¬ 
proaches, the LMP can be easily combined with them by 
applying patches to the LMP-enabled compiler, thus a full- 
CFI protection is possible. 

There are CFI variants proposed with different security 
targets. The techniques of original CFI have been used for 
the purpose of enforcing software-based fault isolation (S- 
FI) [39]. XFI [13] also employs CFI policies with the help 
of debugging information in Windows PDB files to defend 
ROP attack. Data-flow Integrity (DFI) [4] follows CFI ap¬ 
proach to prevent non-control data attacks. Hypersafe [38] 
is similar to fine-grained CFI protection. It has a target ta¬ 
ble for indirect branches and aims at protecting control-flow 
integrity of hypervisor. 

Code-Pointer Integrity (CPI) [22] explores a security mech¬ 
anism that divides process memory into two parts: safe 
memory region and regular memory region. Through static 
analysis, memory objects that have pointers including code 
and data pointers are put into a safe memory region for 
protection against illegal tampering. However, flaws of CPI 
approach have been pointed out [14] because its safe memo¬ 
ry region is not well-protected. The essential idea of LMP is 
also guarding the memory region where shadow stacks locat¬ 
ed. We use new hardware feature of fast memory boundary 
checking to ensure the allocated shadow stack region is pro¬ 
tected effectively and efficiently. 


Other hardware-based CFI approaches have recently been 
proposed, e.g., HCFI [7] and HAFIX [10] have their system 
implemented running on customized FPGA board or SPAR- 
C embedded system. In comparison, LMP is the first sys¬ 
tem with hardware-assisted memory protection compatible 
with commercially available CPU and other hardware. The 
CET (Control-Flow Enforcement Technology) [20] has been 
announced in a preview version in June 2016, the technol¬ 
ogy introduces a new exception class (ffCP) with interrupt 
vector 21, the new ENDBRANCH instructions added to ISA to 
help mark legal targets for indirect branch or jump, and 
officially defined shadow stack for all control transfer opera¬ 
tions. In the CET design, shadow stack is protected by the 
mechanism that protected page table does not allow regular 
store instruction to modify the shadow stack, so additional 
attributes is necessary for shadow stack pages. The CET 
provides a different way to protect the shadow stack from 
being tampered with, however the overhead and cost for 
which is unknown yet, because currently it is not a com¬ 
pleted work yet, more details about hardware and software 
will be released for a evaluation and comparison with LMP 
system. 

7. CONCLUSION 

Memory protection is a keystone of all defense techniques 
against memory corruption attacks. Without properly pro¬ 
tecting the shadow stack, CFI approaches cannot effectively 
prevent ROP attackers and have been proven to be insecure 
in general. Our work proposes a light-weighted memory pro¬ 
tection system to prevent critical memory region storing re¬ 
turn addresses of function call stacks, namely the shadow 
stacks. Leveraging recent available MPX hardware features, 
our approach achieves low overhead in enforcing only legal 
accesses to the protected region is allowed, so that return ad¬ 
dresses cannot be tampered with by an attacker. For future 
work, we will complete the LMP protection on forwarding- 
edge and explore the possibility of applying LMP without 
the limitation of recompilation of the program, for exam¬ 
ple, use the help of binary re-writing to perform the shadow 
stack functions for protection. 
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